Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion
نویسندگان
چکیده
From speech, speaker identity can be mostly characterized by the spectro-temporal structures of spectrum. Although recent researches have demonstrated the effectiveness of employing long short-term memory (LSTM) recurrent neural network (RNN) in voice conversion, traditional LSTM-RNN based approaches usually focus on temporal evolutions of speech features only. In this paper, we improve the conventional LSTM-RNN method for voice conversion by employing the two-dimensional time-frequency LSTM (TFLSTM) to model spectro-temporal warping along both time and frequency axes. A multi-task learned structured output layer (SOL) is afterward adopted to capture the dependencies between spectral and pitch parameters for further improvement, where spectral parameter targets are conditioned upon pitch parameters prediction. Experimental results show the proposed approach outperforms conventional systems in speech quality and speaker similarity.
منابع مشابه
Phoneme Classification Using Temporal Tracking of Speech Clusters in Spectro-temporal Domain
This article presents a new feature extraction technique based on the temporal tracking of clusters in spectro-temporal features space. In the proposed method, auditory cortical outputs were clustered. The attributes of speech clusters were extracted as secondary features. However, the shape and position of speech clusters change during the time. The clusters temporally tracked and temporal tra...
متن کاملResidual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition
In this paper, a novel architecture for a deep recurrent neural network, residual LSTM is introduced. A plain LSTM has an internal memory cell that can learn long term dependencies of sequential data. It also provides a temporal shortcut path to avoid vanishing or exploding gradients in the temporal domain. The residual LSTM provides an additional spatial shortcut path from lower layers for eff...
متن کاملReducing the Computational Complexity of Two-Dimensional LSTMs
Long Short-Term Memory Recurrent Neural Networks (LSTMs) are good at modeling temporal variations in speech recognition tasks, and have become an integral component of many state-of-the-art ASR systems. More recently, LSTMs have been extended to model variations in the speech signal in two dimensions, namely time and frequency [1, 2]. However, one of the problems with two-dimensional LSTMs, suc...
متن کاملAttention Based CLDNNs for Short-Duration Acoustic Scene Classification
Recently, neural networks with deep architecture have been widely applied to acoustic scene classification. Both Convolutional Neural Networks (CNNs) and Long Short-Term Memory Networks (LSTMs) have shown improvements over fully connected Deep Neural Networks (DNNs). Motivated by the fact that CNNs, LSTMs and DNNs are complimentary in their modeling capability, we apply the CLDNNs (Convolutiona...
متن کاملDeep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion
Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long shortterm memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to de...
متن کامل